{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 第8章 提升方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 习题8.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "&emsp;&emsp;某公司招聘职员考查身体、业务能力、发展潜力这3项。身体分为合格1、不合格0两级，业务能力和发展潜力分为上1、中2、下3三级。分类为合格1 、不合格-1两类。已知10个人的数据，如下表所示。假设弱分类器为决策树桩。试用AdaBoost算法学习一个强分类器。  \n",
    "\n",
    "应聘人员情况数据表\n",
    "\n",
    "&emsp;&emsp;|1|2|3|4|5|6|7|8|9|10\n",
    "-|-|-|-|-|-|-|-|-|-|-\n",
    "身体|0|0|1|1|1|0|1|1|1|0\n",
    "业务|1|3|2|1|2|1|1|1|3|2\n",
    "潜力|3|1|2|3|3|2|2|1|1|1\n",
    "分类|-1|-1|-1|-1|-1|-1|1|1|-1|-1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**\n",
    "1. 列出AdaBoost算法；\n",
    "2. 采用sklearn的AdaBoostClassifier分类器，构建并训练得到强分类器；\n",
    "3. 自编程实现AdaBoost算法，并训练得到强分类器。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：提升方法AdaBoost算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第8章的算法8.1：\n",
    "\n",
    "> **算法8.1（AdaBoost）**  \n",
    "> 输入：训练数据集$T=\\{(x_1,y_1), (x_2,y_2), \\cdots, (x_N,y_N)\\}$，其中$x_i \\in \\mathcal{X} \\subseteq R^n$，$y_i \\in \\mathcal{Y} = \\{-1, +1\\}$；弱学习算法；  \n",
    "输出：最终分类器$G(x)$。  \n",
    "（1）初始化训练数据的权值分布\n",
    "> $$ D_1 = (w_{11}, \\cdots, w_{1i}, \\cdots, w_{1N}), \\quad w_{1i} = \\frac{1}{N}, \\quad i = 1, 2, \\cdots, N $$  \n",
    ">（2）对$m=1, 2, \\cdots, M$  \n",
    "&emsp;&emsp;（a）使用具有权值分布$D_m$的训练数据集学习，得到基本分类器\n",
    "> $$ G_m(x):\\mathcal{X} \\rightarrow \\{-1, +1\\} $$\n",
    "> &emsp;&emsp;（b）计算$G_m(x)$在训练数据集上的分类误差率\n",
    "> $$ e_m = \\sum_{i=1}^N P(G_m(x_i) \\neq y_i) = \\sum_{i=1}^N w_{mi}I(G_m(x_i) \\neq y_i) $$\n",
    "> &emsp;&emsp;（c）计算$G_m(x)$的系数\n",
    "> $$ \\alpha_m = \\frac{1}{2} \\log \\frac{1 - e_m}{e_m} $$\n",
    "> &emsp;&emsp;这里的对数是自然对数。  \n",
    "&emsp;&emsp;（d）更新训练数据集的权值分布\n",
    "> $$ D_{m+1} = (w_{m+1,1}, \\cdots, w_{m+1, i}, \\cdots, w_{m+1,N}) \\\\\n",
    "w_{m+1,i} = \\frac{w_{mi}}{Z_m} \\exp(-\\alpha_m y_i G_m(x_i)), \\quad i = 1, 2, \\cdots, N $$\n",
    "> &emsp;&emsp;这里，$Z_m$是规范化因子\n",
    "> $$ Z_m = \\sum_{i=1}^N w_{mi} \\exp(-\\alpha_m y_i G_m(x_i)) $$\n",
    "> &emsp;&emsp;它使$D_{m+1}$成为一个概率分布。  \n",
    "（3）构建基本分类器的线性组合\n",
    "> $$ f(x) = \\sum_{m=1}^M \\alpha_m G_m(x) $$\n",
    "> 得到最终分类器\n",
    "> $$ \\begin{aligned}\n",
    "G(x) &= \\text{sign}(f(x)) \\\\\n",
    "&= \\text{sign}\\left(\\sum_{m=1}^M \\alpha_m G_m(x) \\right)\n",
    "\\end{aligned} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：采用AdaBoostClassifier分类器实现**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据题目要求弱分类器采用决策树，通过sklearn的AdaBoostClassifier类，构建分类器，由于AdaBoostClassifier分类器默认采用CART决策树弱分类器，故不需要设置base_estimator参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "原始输出: [-1 -1 -1 -1 -1 -1  1  1 -1 -1]\n",
      "预测输出: [-1 -1 -1 -1 -1 -1  1  1 -1 -1]\n",
      "预测准确率：100.00%\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import AdaBoostClassifier\n",
    "import numpy as np\n",
    "\n",
    "# 加载训练数据\n",
    "X = np.array([[0, 1, 3],\n",
    "              [0, 3, 1],\n",
    "              [1, 2, 2],\n",
    "              [1, 1, 3],\n",
    "              [1, 2, 3],\n",
    "              [0, 1, 2],\n",
    "              [1, 1, 2],\n",
    "              [1, 1, 1],\n",
    "              [1, 3, 1],\n",
    "              [0, 2, 1]\n",
    "              ])\n",
    "y = np.array([-1, -1, -1, -1, -1, -1, 1, 1, -1, -1])\n",
    "\n",
    "# 使用sklearn的AdaBoostClassifier分类器\n",
    "clf = AdaBoostClassifier()\n",
    "# 进行分类器训练\n",
    "clf.fit(X, y)\n",
    "# 对数据进行预测\n",
    "y_predict = clf.predict(X)\n",
    "# 得到分类器的预测准确率\n",
    "score = clf.score(X, y)\n",
    "print(\"原始输出:\", y)\n",
    "print(\"预测输出:\", y_predict)\n",
    "print(\"预测准确率：{:.2%}\".format(score))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：自编程实现AdaBoost算法**\n",
    "\n",
    "代码思路：\n",
    "1. 写出fit函数，即分类器训练函数；\n",
    "2. 根据书中第158页例8.1，编写build_stump函数，用于得到分类误差最低的基本分类器；\n",
    "3. 根据算法第2步(a)~(c)，编写代码；\n",
    "4. 根据算法第2步(d)，编写updata_w函数，用于更新训练数据集的权值分布；\n",
    "5. 编写predict函数，用于预测数据；\n",
    "6. 【附加】编写score函数，用于计算分类器的预测准确率；\n",
    "7. 【附加】编写print_G函数，用于打印最终分类器。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "class MyAdaBoost:\n",
    "    def __init__(self, tol=0.05, max_iter=10):\n",
    "        # 特征\n",
    "        self.X = None\n",
    "        # 标签\n",
    "        self.y = None\n",
    "        # 分类误差小于精度时，分类器训练中止\n",
    "        self.tol = tol\n",
    "        # 最大迭代次数\n",
    "        self.max_iter = max_iter\n",
    "        # 权值分布\n",
    "        self.w = None\n",
    "        # 弱分类器集合\n",
    "        self.G = []\n",
    "\n",
    "    def build_stump(self):\n",
    "        \"\"\"\n",
    "        以带权重的分类误差最小为目标，选择最佳分类阈值，得到最佳的决策树桩\n",
    "        best_stump['dim'] 合适特征的所在维度\n",
    "        best_stump['thresh']  合适特征的阈值\n",
    "        best_stump['ineq']  树桩分类的标识lt,rt\n",
    "        \"\"\"\n",
    "        m, n = np.shape(self.X)\n",
    "        # 分类误差\n",
    "        min_error = np.inf\n",
    "        # 小于分类阈值的样本所属的标签类别\n",
    "        sign = None\n",
    "        # 最优决策树桩\n",
    "        best_stump = {}\n",
    "        for i in range(n):\n",
    "            # 求每一种特征的最小值和最大值\n",
    "            range_min = self.X[:, i].min()\n",
    "            range_max = self.X[:, i].max()\n",
    "            step_size = (range_max - range_min) / n\n",
    "            for j in range(-1, int(n) + 1):\n",
    "                # 根据n的值，构造切分点\n",
    "                thresh_val = range_min + j * step_size\n",
    "                # 计算左子树和右子树的误差\n",
    "                for inequal in ['lt', 'rt']:\n",
    "                    # (a)得到基本分类器\n",
    "                    predict_values = self.base_estimator(self.X, i, thresh_val, inequal)\n",
    "                    # (b)计算在训练集上的分类误差率\n",
    "                    err_arr = np.array(np.ones(m))\n",
    "                    err_arr[predict_values.T == self.y.T] = 0\n",
    "                    weighted_error = np.dot(self.w, err_arr)\n",
    "                    if weighted_error < min_error:\n",
    "                        min_error = weighted_error\n",
    "                        sign = predict_values\n",
    "                        best_stump['dim'] = i\n",
    "                        best_stump['thresh'] = thresh_val\n",
    "                        best_stump['ineq'] = inequal\n",
    "        return best_stump, sign, min_error\n",
    "\n",
    "    def updata_w(self, alpha, predict):\n",
    "        \"\"\"\n",
    "        更新样本权重w\n",
    "        :param alpha: alpha\n",
    "        :param predict: yi\n",
    "        :return:\n",
    "        \"\"\"\n",
    "        # (d)根据迭代公式，更新权值分布\n",
    "        P = self.w * np.exp(-alpha * self.y * predict)\n",
    "        self.w = P / P.sum()\n",
    "\n",
    "    @staticmethod\n",
    "    def base_estimator(X, dimen, thresh_val, thresh_ineq):\n",
    "        \"\"\"\n",
    "        计算单个弱分类器（决策树桩）预测输出\n",
    "        :param X: 特征\n",
    "        :param dimen: 特征的位置（即第几个特征）\n",
    "        :param thresh_val: 切分点\n",
    "        :param thresh_ineq: 标记结点的位置，可取左子树(lt)，右子树(rt)\n",
    "        :return: 返回预测结果矩阵\n",
    "        \"\"\"\n",
    "        # 预测结果矩阵\n",
    "        ret_array = np.ones(np.shape(X)[0])\n",
    "        # 左叶子 ，整个矩阵的样本进行比较赋值\n",
    "        if thresh_ineq == 'lt':\n",
    "            ret_array[X[:, dimen] >= thresh_val] = -1.0\n",
    "        else:\n",
    "            ret_array[X[:, dimen] < thresh_val] = -1.0\n",
    "        return ret_array\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        \"\"\"\n",
    "        对分类器进行训练\n",
    "        \"\"\"\n",
    "        self.X = X\n",
    "        self.y = y\n",
    "        # （1）初始化训练数据的权值分布\n",
    "        self.w = np.full((X.shape[0]), 1 / X.shape[0])\n",
    "        G = 0\n",
    "        # （2）对m=1,2,...,M进行遍历\n",
    "        for i in range(self.max_iter):\n",
    "            # (b)得到Gm(x)的分类误差error，获取当前迭代最佳分类阈值sign\n",
    "            best_stump, sign, error = self.build_stump()\n",
    "            # (c)计算弱分类器Gm(x)的系数\n",
    "            alpha = 1 / 2 * np.log((1 - error) / error)\n",
    "            # 弱分类器Gm(x)权重\n",
    "            best_stump['alpha'] = alpha\n",
    "            # 保存弱分类器Gm(x)，得到分类器集合G\n",
    "            self.G.append(best_stump)\n",
    "            # 计算当前总分类器（之前所有弱分类器加权和）误差率\n",
    "            G += alpha * sign\n",
    "            y_predict = np.sign(G)\n",
    "            # 使用MAE计算误差\n",
    "            error_rate = np.sum(np.abs(y_predict - self.y)) / self.y.shape[0]\n",
    "            if error_rate < self.tol:\n",
    "                # 满足中止条件，则跳出循环\n",
    "                print(\"迭代次数：{}次\".format(i + 1))\n",
    "                break\n",
    "            else:\n",
    "                # (d)更新训练数据集的权值分布\n",
    "                self.updata_w(alpha, alpha * sign)\n",
    "\n",
    "    def predict(self, X):\n",
    "        \"\"\"对新数据进行预测\"\"\"\n",
    "        m = np.shape(X)[0]\n",
    "        G = np.zeros(m)\n",
    "        for i in range(len(self.G)):\n",
    "            stump = self.G[i]\n",
    "            # 遍历每一个弱分类器，进行加权\n",
    "            _G = self.base_estimator(X, stump['dim'], stump['thresh'], stump['ineq'])\n",
    "            alpha = stump['alpha']\n",
    "            # (3)构建基本分类器的线性组合\n",
    "            G += alpha * _G\n",
    "        # 计算最终分类器的预测结果\n",
    "        y_predict = np.sign(G)\n",
    "        return y_predict.astype(int)\n",
    "\n",
    "    def score(self, X, y):\n",
    "        \"\"\"计算分类器的预测准确率\"\"\"\n",
    "        y_predict = self.predict(X)\n",
    "        # 使用MAE计算误差\n",
    "        error_rate = np.sum(np.abs(y_predict - y)) / y.shape[0]\n",
    "        return 1 - error_rate\n",
    "\n",
    "    def print_G(self):\n",
    "        i = 1\n",
    "        s = \"G(x) = sign[f(x)] = sign[\"\n",
    "        for stump in self.G:\n",
    "            if i != 1:\n",
    "                s += \" + \"\n",
    "            s += \"{}·G{}(x)\".format(round(stump['alpha'], 4), i)\n",
    "            i += 1\n",
    "        s += \"]\"\n",
    "        return s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "迭代次数：5次\n",
      "原始输出: [-1 -1 -1 -1 -1 -1  1  1 -1 -1]\n",
      "预测输出: [-1 -1 -1 -1 -1 -1  1  1 -1 -1]\n",
      "预测正确率：100.00%\n",
      "最终分类器G(x)为: G(x) = sign[f(x)] = sign[0.6931·G1(x) + 0.6133·G2(x) + 0.4032·G3(x) + 0.4681·G4(x) + 0.3736·G5(x)]\n"
     ]
    }
   ],
   "source": [
    "# 加载训练数据\n",
    "X = np.array([[0, 1, 3],\n",
    "              [0, 3, 1],\n",
    "              [1, 2, 2],\n",
    "              [1, 1, 3],\n",
    "              [1, 2, 3],\n",
    "              [0, 1, 2],\n",
    "              [1, 1, 2],\n",
    "              [1, 1, 1],\n",
    "              [1, 3, 1],\n",
    "              [0, 2, 1]\n",
    "              ])\n",
    "y = np.array([-1, -1, -1, -1, -1, -1, 1, 1, -1, -1])\n",
    "\n",
    "clf = MyAdaBoost()\n",
    "clf.fit(X, y)\n",
    "y_predict = clf.predict(X)\n",
    "score = clf.score(X, y)\n",
    "print(\"原始输出:\", y)\n",
    "print(\"预测输出:\", y_predict)\n",
    "print(\"预测正确率：{:.2%}\".format(score))\n",
    "print(\"最终分类器G(x)为:\", clf.print_G())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题8.2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;比较支持向量机、 AdaBoost 、逻辑斯谛回归模型的学习策略与算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**\n",
    "1. 列出支持向量机的学习策略与学习算法\n",
    "2. 列出AdaBoost的学习策略与学习算法\n",
    "3. 列出逻辑斯谛回归模型的学习策略与学习算法\n",
    "4. 比较三者的学习策略与算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：支持向量机的学习策略与算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第7.2.4节的合页损失函数\n",
    "\n",
    "> &emsp;&emsp;对于线性支持向量机学习来说，其模型为分离超平面$w^* \\cdot x + b^* = 0$及决策函数$f(x)=\\text{sign}(w^* \\cdot x + b^*)$，其学习策略为软间隔最大化，学习算法为凸二次规划。  \n",
    "> &emsp;&emsp;线性支持向量机学习还有另外一种解释，就是最小化一下目标函数：\n",
    "> $$ \\sum_{i=1}^N [1 - y_i(w \\cdot x_i + b)]_+ + \\lambda \\|w\\|^2 $$\n",
    "> 目标函数的第1项是经验损失或经验风险，函数\n",
    "> $$ L(y(w \\cdot b + x)) = [1 - y_i(w \\cdot x_i + b)]_+ $$\n",
    "> 被称为合页损失函数，第2项是系数为$\\lambda$的$w$的$L_2$范数，是正则化项。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第7.4节的序列最小最优化算法\n",
    "\n",
    "> &emsp;&emsp;SMO算法是一种启发式算法，其基本思路是：如果所有变量的解都满足此最优化问题的KKT条件，那么这个最优化问题的解就得到了。因为KKT条件是该最优化问题的充分必要条件。  \n",
    "> &emsp;&emsp;整个SMO算法包括两个部分：求解两个变量二次规划的解析方法和选择变量的启发式方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "综上所述：\n",
    "1. 支持向量机的学习策略：软间隔最大化、最小化由合页损失函数和正则化项组成的目标函数\n",
    "2. 支持向量机的学习算法：凸二次规划、SMO算法（序列最小最优化算法）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：AdaBoost的学习策略与算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第8.3节的AdbBoost算法的解释\n",
    "\n",
    "> &emsp;&emsp;AdaBoost算法还有另一个解释，即可认为AdaBoost算法是模型为加法模型、损失函数为指数函数、学习算法为前向分步算法时的二类分类学习方法。  \n",
    "> &emsp;&emsp;给定训练数据及损失函数$L(y,f(x))$的条件下，学习加法模型$f(x)$成为经验风险极小化即损失函数极小化问题：\n",
    "> $$ \\min \\limits_{\\beta_m ,\\gamma_m} \\sum_{i=1}^N L \\left(y_i, \\sum_{m=1}^M \\beta_m b(x_i;\\gamma_m) \\right) $$\n",
    ">\n",
    "> &emsp;&emsp;定理8.3 AdaBoost算法是前向分步加法算法的特例。这时，模型是由基本分类器组成的加法模型，损失函数是指数函数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "综上所述：\n",
    "1. AdaBoost的学习策略：极小化通过加法模型组成的指数损失函数\n",
    "2. AdaBoost的学习算法：学习加法模型的前向分步算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：逻辑斯谛回归模型的学习策略与算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第6.1.3节的模型参数估计：\n",
    "\n",
    "> &emsp;&emsp;逻辑斯谛回归模型学习时，对于给定的训练数据集$T=\\{(x_1,y_1), (x_2,y_2), \\cdots, (x_N,y_N)\\}$，其中$x_i \\in R^n$，$y_i \\in \\{0, 1\\}$，可以应用极大似然估计法估计模型参数，从而得到逻辑斯谛回归模型。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第6.3节的模型学习的最优化算法：\n",
    "\n",
    "> &emsp;&emsp;逻辑斯谛回归模型、最大熵模型学习归结为以似然函数为目标函数的最优化问题，通常通过迭代算法求解。常用的方法有改进的迭代尺度法、梯度下降法、牛顿法或拟牛顿法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "综上所述：\n",
    "1. 逻辑斯谛回归模型的学习策略：极大似然估计法\n",
    "2. 逻辑斯谛回归模型的学习算法：改进的迭代尺度法、梯度下降法、牛顿法或拟牛顿法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第4步：比较支持向量机、 AdaBoost 、逻辑斯谛回归模型的学习策略与算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| &emsp;&emsp; | 学习策略 | 算法 |\n",
    "| :---: | :---- | :---- |\n",
    "| 支持向量机 | 软间隔最大化、最小化由合页损失函数和正则化项组成的目标函数 | 凸二次规划、SMO算法（序列最小最优化算法） |\n",
    "| AdaBoost | 极小化通过加法模型组成的指数损失函数 | 学习加法模型的前向分步算法 |\n",
    "| 逻辑斯谛回归 | 极大似然估计法 | 改进的迭代尺度法、梯度下降法、牛顿法或拟牛顿法 |"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
